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Architectural choices for the Columbia 0.8 Teraflops machine 

I. V. Arsenin'** 

'^Physics Department, Columbia University, 
New York, NY 10027, U.S.A. 

We discuss the hardware design choices made in our IGJC-node 0.8 Teraflops supercomputer project, a ma- 
chine architecture optimized for fuil QCD calculations. The efflciency of the conjugate gradient algorithm in 
terms of balance of floating-point operations, memory handling and utilization, and communication overhead is 
addressed. We also discuss the technological innovations and software tools that facilitate hardware design and 
what opportunities these give to the academic community. 



1. OVERVIEW 

The preliminary discussions of the Columbia 
0.8 Teraflops project began in Spring 1993. Since 
1989 the Columbia group has performed lattice 
QCD calculations on the Columbia 256-node par- 
allel computer with 16 Gflops peak speed. It be- 
came clear by that time that the capacity of this 
computer will have been exhausted in the course 
of few years and it was necessary to look into ways 
to enhance the computing power in order to stay 
at the forefront of lattice calculations. 

The timing of our decision to embark on the 
building of a new supercomputer was greatly in- 
fluenced by advances in technology that have 
produced inexpensive and relatively fast proces- 
sors, external memory units, and peripherials 
that could be customized for the desired tasks. 
Our choice was in favor of a machine that con- 
sists of a large number of simple nodes connected 
by an efficient communication network — a con- 
figuration, we believe, more promising and cost 
efficient than a few high speed and high capacity 
processors which have yet to become available at 
reasonable prices. 

The Columbia 0.8 Teraffops computer is a col- 
lection of 16,384 nodes connected in a 16'^ x 4 
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Figure 1. The functional diagram of one of the 
16K nodes. 



four-dimensional, serial network. The main ele- 
ments of the node (Fig. P are: 

• Digital Signal Processor (DSP) capable of 
performing a ffoating point multiplication 
and an accumulation in a single 25MHz cy- 
cle. We are currently using Texas Instru- 
ments TMS320C30 50MHz parts. For more 
information on DSP applications see Q. 

• Dynamic Random Access Memory 
(DRAM). We are supporting either 256 A: x 
16 or 512A: x 8 parts from various vendors. 
These memory devices are connected to a 
39-bit data bus (32-bit data word with 7 
bits for Error Detection and Correction). 
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• The cornerstone of our design, the NGA 
(Node Gate Array), is a customized gate 
array (Fig. |2|). The NGA provides a con- 
troller for the serial communication pro- 
tocol, arbitrates memory accesses, handles 
recovery from errors in DRAM and se- 
rial wires, and includes a buffer that bal- 
ances data transfers between the DSP and 
DRAM. 
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Figure 2. The main components of the NGA. 



The opportunity to include almost all I/O con- 
trollers and global signal handlers in one VLSI 
chip significantly simplified the configuration of 
the node and at the same time added a new level 
of flexibility to the hardware. In designing the 
NGA, we tried to shift as much of the low level 
functionality as possible to the hardware thus al- 
lowing the programmer to concentrate more on 
the implementation of the physics algorithms. At 
the same time we are leaving enough hooks for a 
sophisticated user concerned with getting maxi- 
mum efficiency. 

The design of a truly parallel computer requires 
a well balanced communication scheme with the 
throughput matching the speed of the CPU. The 
optimal communications vs. computations ratio 
depends on the algorithms to be run on the com- 
puter. Our task was to produce a Lattice QCD 
machine able to run full QCD calculations effi- 



ciently. Our benchmark algorithm for determin- 
ing balance with the serial communications and 
memory accesses is a single conjugate gradient 
(CG) update from a staggered fermion evolution. 
This is more or less an obvious choice for the full 
QCD calculation since matrix inversions take up 
most of the CPU time. We also used code for 
Wilson fcrmions which, though less communica- 
tions intensive, provides a useful tool for estimat- 
ing overall performance. A serial communication 
scheme with transfers going in four " space-time" 
directions does not create substantial overhead 
for vector-matrix multiplications if run at a speed 
exceeding ITMHz. On the other hand, global dot 
products that require the sum of the spinors on 
all nodes would take up to 70% of the CPU time if 
serial transfers were run at 25MIIz. Our SCU (Se- 
rial Communication Unit) thus has been designed 
to perform "on-the-fly" add, max, and broadcast 
operations between the data residing on a partic- 
ular node and the data coming from the adjacent 
nodes. This mode significantly reduces the la- 
tency of transferring information across the com- 
puter. 

Another aspect of achieving balance among dif- 
ferent components of the node is managing slow 
DRAM. This is done by introducing a pipeline 
stage in the path between the DSP and the 
memory. A 32-word Circular Buffer can be in- 
structed to prefetch a certain number of words 
from DRAM which can be read by the DSP in 
zero wait state mode later on. The Circular 
Buffer has an embedded protection against ac- 
cesses to invalid data locations unless the pro- 
grammer specifically turns this protection off. 
The size of the Circular Buffer comfortably ac- 
commodates all eighteen elements of a SU{3) 
color matrix. 

2. MODERN DESIGN TOOLS 

The NGA incorporates almost all the nontriv- 
ial functionality that makes a collection of CPU's 
and memory devices a parallel computer. Please, 
refer to 0] or ||] for the description of the upper 
level design blocks of our project. In this section 
we will concentrate on the implementation of the 
ideas described in the the previous section. 
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Not long ago, at the final stages of the de- 
sign process, one had to produce schematic draw- 
ings, then solder available logic elements and 
more sophisticated standardized medium integra- 
tion components on printed circuit boards, and 
test the result using a logical analyzer. The de- 
sign process was time consuming and hardly feasi- 
ble outside large companies or research centers. A 
major advance occurred a few years ago with the 
advent of hardware description languages, such as 
VHDL and Verilog, that allow a designer to for- 
malize some higher level concepts and provide a 
high degree of automation in transforming this 
description into an actual schematic accepted by 
VLSI manufacturers. 

The basic entities that VHDL (our language 
of choice) describes are signals and their sequen- 
tial and concurrent assignments. One can write 
boolean style logic equations connecting incom- 
ing signals such as the DSP address bus with 
outgoing signals such as serial wires constituting 
the communication network or controls for pe- 
ripherial devices. Finite state machines, that re- 
act to the incoming signals according to the state 
the machine is in, are naturally implemented as 
clocked processes where the state is represented 
by an internal signal that is allowed to change 
value only at a specific edge of the clock. The 
following is an example of a two-bit counter: 

wait until prising(clock) ; 

countO <= not countO; 

count 1 <= count 1 xor countO; 

Once the desired logical equations are writ- 
ten, the design can be simulated as a black box 
which reacts to the external stimuli represented 
by time patterns for each incoming signal. We 
also use models for the standard components such 
as DSPs and DRAMs available from the Logic 
Modeling Corporation which allow us to check the 
logical consistency of a design by running a piece 
of CG code on a model of the DSP which com- 
municates with our model of the NGA which, in 
turn, reads and writes to a model of the memory. 

The next important step is synthesis of the 
VHDL source code; the output of which is the 
desired schematic. This process is similar to com- 
piling an ordinary program. The gate array man- 



ufacturer supplies a library of components which 
represent basic constructs in VHDL. The synthe- 
sizer parses the code and substitutes these con- 
structs with the elements of the library, properly 
connecting them to each other. After the initial 
run, the synthesizer proceeds to the iterative op- 
timization task by looking at larger clusters of 
components and trying to reduce them to more 
compact ones following the list of constraints such 
as speed, size, or load requirements set by the 
user. The development of efficient synthesis tools 
is very much in progress now and the current ones 
are not perfect. Some human intervention is still 
required to produce sound results. For more in- 
formation on this aspect of design see |^ . 

The final step in the design is the simulation 
which includes correct propagation delays sup- 
plied by the vendor. At this stage one can identify 
parts of the design that are too slow to keep up 
with the desired speed of the computer and opti- 
mize them either by rearranging the VHDL code 
or by applying more severe constraints to the syn- 
thesizer. There are a number of tools available for 
this: one can list the asynchronous paths with 
delays larger than a set number, include such sig- 
nals in a watch list from the simulator, look at 
the diagnostic output of the synthesizer, and so 
on. At the same time one can run large pieces of 
assembly code on the model of the DSP, checking 
the performance of physics code on a cycle by cy- 
cle basis and revealing obscure bugs that might 
otherwise show up only after thousands of cycles 
of simulation. The performance figures for the 
Columbia 0.8 Teraflops project are based on these 
simulations and presented in reference j^. 
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